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Abtract: Web page classification is the process of classifying documents into predefined categories based on their 
content. The task of data mining is to automatically classify documents into predefined classes based on their content. 
Many algorithms have been developed to deal with automatic text classification. The most common techniques used for 
this purpose work include Apriori Algorithm and implementation of Naive Bayes Classifier. Apriori Algorithm finds 
interesting association or correlation relationships among a large set of data items. The discovery of these relationships 
among huge amounts of transaction records can help in many decision making process. The Naive Bayes Classifier 
uses the maximum a posterior estimation for learning a classifier. Then, use Naive Bayes Classifier to calculate 
probability of keywords among a large data itemsets. Moreover, this technique is efficient for web page classification. 
The technique will be more effective is the training set is set in such a way that it generates more sets. Though the 
experimental results are quite encouraging, it would better if the work with larger data sets with more classes. 
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I. Introduction 

There are numerous web pages available in electronic 
form. Such documents represent a massive amount of 
information that is easily accessible. Seeking value in this 
huge collection requires organization; much of the work of 
organizing documents can be automated through data 
mining [4] [11]. The accuracy and our understanding of 
such systems greatly influence their usefulness. The task 
of web mining is to automatically classify documents into 
predefined classes based on their content. Many 
algorithms [5] have been developed to deal with automatic 
text classification [5]. The most common techniques used 
for this purpose include Apriori Algorithm [5] [6] [8] and 
implementation of Naive Bayes Classifier [5][6]. 

Apriori Algorithm [5] [6] finds interesting 
association[2][8] or correlation relationships among a 
large set of data items. The discovery of these 
relationships among huge amounts of transaction records 
can help in many decision making process. On the other 
hand, the Naive Bayes Classifier [5] [6] uses the maximum 
a posterior estimation for learning a classifier. Then, the 
Naive Bayes Classifier [5] [6] to calculate probability of 
keywords among a large data itemsets. 

II. Background Study 

A. Data Cleaning [7] [11] 

Data cleaning [7], also called data cleansing or 
scrubbing, deals with detecting and removing errors and 
inconsistencies from data in order to improve the quality 



of data. Data quality problems are present in single data 
collections, such as files and databases, e.g., due to 
misspellings during data entry, missing information or 
other invalid data. When multiple data sources need to be 
integrated, e.g., in data warehouses, federated database 
systems or global web-based information systems, the 
need for data cleaning increases significantly. 

Most natural languages have so-called function word and 
connections such as articles and preposition that appear in 
a larger number if documents and are typically of little use 
in pinpointing documents that satisfy a searcher's 
information need. Such words (e.g., a, an, the, on for 
English) are stop words. 

Stop words - words which do not contain important 
significant information or occur so often that in text that 
they lose their usefulness. 

Following steps to remove noisy data in each web page: 

• In this research, first remove the noisy data by 
using application. 

• Each abstracts used to train is considered as a 
transaction in the text data. 

• The text data is cleaned by removing 
unnecessary words or noisy data i.e. text data is 
filtered and subject related words are collected. 

Therefore, a useful pre-processing step is to run your data 
through some data cleaning [11] routines. 
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B. Association Rule [12 ] 

Association rules are an important class of regularities 
in data. Mining of association rules is a fundamental data 
mining task. It is perhaps the most important model 
invented and extensively studied by the database and data 
mining community. Its objective is to find all co- 
occurrence relationships, called associations, among data 
items. 

The left hand side of an association rule is called the 
antecedent, and the tight hand side is the consequent. In 
the Cheese — > Beer example Beer is the antecedent and 
Cheese is the consequent. 

The classic application of association rule mining is the 
market basket data analysis, which aims to discover how 
items purchased by customers in a supermarket (or a store) 
are associated. An example association rule is 

Cheese — » Beer [support = 10%, confidence = 80%] 
The rule says that 10% customers buy Cheese and Beer 
together, and those who buy Cheese also buy Beer 80% of 
the time. 

The problem of mining association rules can be stated as 
follows: Let / = {il, i2, im] be a set of items. Let T = 
(fl, f2, tn) be a set of transactions (the database), where 
each transaction ti is a set of items such that ti subset ofl. 
An association rule is an implication of the form, 

X->Y, where X c /, Y c /, and X nY = 
X (or Y) is a set of items, called an itemset. 

Support: The support is the ratio (or percentage) of the 
number of itemsets satisfying both antecesent and 
consequent to the total number of transaction [9]. The 
support of a rule, X — > Y, is the percentage of transactions 
in T that contains^ U Y, and can be seen as an estimate of 
the probability, Pr(X U Y). The rule support thus 
determines how frequent the rule is applicable in the 
transaction set T. Let n be the number of transactions in T. 
The support of the rule X — >Y is computed as follows: 



support 



(XUY). count 



...(1) 



confidence = 



(XUY). count 
X. count 



(2) 



Confidence thus determines the predictability of the rule. 
If the confidence of a rule is too low, one cannot reliably 
infer or predict Y from X. A rule with low predictability is 
of limited use. 

C. Apriori Algorithm [5][6][8] 

Apriori [5] [6] [8] is a strongly influencing later 
development algorithm for finding frequent itemsets using 
candidate generation. Apriori [5] [6] [8] is an influential 
algorithm for mining frequent itemsets for Boolean 
association rules. The name of the algorithm is based on 
the fact that the algorithm uses prior knowledge of 
frequent itemsets properties. 

Apriori [5] [6] [8] employs an iterative approach known as 
a level-wise search, where A:-itemsets are used to explore 
(k+1 )-itemsets. First, the set of frequent 1 -itemsets is 
found. This set is denoted by Lj. Lj is used to find L 2 , the 
set of frequent 2-itemsets, which is used to find Li, and so 
on, until no more frequent fc-itemsets can be found. The 
finding of each L k requires one full scan of the database. 
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Fig. 1 Example of Apriori Algorithm 



Support is a useful measure because if it is too low, the 
rule may just occur due to chance. Furthermore, in a 
business environment, a rule covering too few cases (or 
transactions) may not be useful because it does not make * 
business sense to act on such a rule (not profitable). 

Confidence: Confidence (strength or evidence) is derives 
from a subset of the transaction in which two entities (or 
activities) are related [9]. The confidence of a rule, X — * Y, 
is the percentage of transactions in T that contain X also 
contain Y. It can be seen as an estimate of the conditional 
probability, Pr(Y I X). It is computed as follows: 



In the first iteration of the algorithm, each item is a 
number of the set of candidate 1 -itemsets, CI. The 
algorithm simply scans all of the transactions in order 
to count the number of occurrences of each item. 
Suppose that the minimum transaction support count 
required is 2 (i.e.; min_sup = 2/5 = 40%). The set of 
frequent 1 -itemsets, LI, can then be determined. It 
consists of the candidate 1 -itemsets satisfying 
minimum support. 

To discover the set of frequent 2-itemsets, L2, the 
algorithm uses LI I L2 to generate a candidate set of 
2-itemsets, C2. 

The transactions in D are scanned and the support 
count of each candidate itemset in C2 is accumulated. 
The set of frequent 2-itemsets, L2, is then determined, 
consisting of those candidate-itemsets in C2 having 
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minimum support. 

• The generation of the set of candidate 3-itemsets, C3 
is observed in step 7 to step 8. Here C3 = LI I L2 = 
{{I, 2, 3j, {1, 2, 5j, {I, 3, 5J, {2, 3, 5JJ. Based on the 
Apriori property that all subsets of a frequent itemset 
must also be frequent, we can determine that the four 
latter candidates cannot possibly be frequent. 

• The transactions in D are scanned in order to 
determine L3, consisting of those candidate 3- 
itemsets in C3 having minimum support. The 
algorithm uses L3 I L4 to generate a candidate set of 
4-itemsets, C4. Although the join results in {{1, 2, 3, 
5}}, this itemset is pruned since its subset {{2, 3, 5}} 
is not frequent. Thus, C4 = {}, and the algorithm 
terminates. 

To understand how Apriori [5] [6] [8] property is used in 
the algorithm, let us look at how L k _j is used to find L k . A 
two step process is followed, consisting of join and prune 
actions: 

i. The Join Step: 

To find L k , a set of candidate A:-itemsets is generated 
by joining L k .j with itself. This set of candidates is denoted 
by C k . Let lj and l 2 be itemsets in L k .j then lj and l 2 are 
joinable if their first (k-2) items are in common, i.e., (lj[l] 

= hlU) . (lj[2] = l 2 [2]) (li[k-2]= Uk-2]) . (h[k- 

1]< l 2 [k-1]). 

ii. The Prune Step: 

C k is the superset of L k . The scan of database to 
determine count, if each of candidate in C k would result in 
the determination of L k (itemsets having a count no less 
than minimum support in C k ). But this scan and 
computation can be reduced by applying the Apriori 
property. Any (fc-ij-itemsets that is not frequent cannot be 
a subset of a frequent fe-itemset. Hence if any (fc-7)-subset 
of a candidate fc-itemset is not in L k .j, then the candidate 
cannot be frequent either and so can be removed from C k . 

The algorithm is as follows: 

Input: Database, D; minimum support threshold, 
min_sup. 

Output: L, frequent itemsets in D. 

(1) Lj = find frequent l-itemsets(D); 

(2) for(k=2;L k _ 1 ±0;k++) 

(3) { 

(4) Ct = apriori-gen^.j, min_sup); 

(5) for each transaction t € D //scan D for counts 

(6) { 

(7) Ct = subset(Ck,t); //get the subsets of t that are 
candidates 

(8) for each candidate c € Ct 

(9) c.count++; 

(10) } 

(11) Lk = {C € Ck\ c. count > minimum _sup } 

(12) } 

(13) return L=Uk Lk; 



The Apriori [5] [6] achieves good performance by reducing 
the size of candidate sets. However, in situations with very 
many frequent itemsets, large itemsets, or very low 
minimum support, it still suffers from the cost of 
generating a huge number of candidate sets and scanning 
the database repeatedly to check a large set of candidate 
itemsets. 

D. Naive Bayes Classifier [5] [6] [8] [9] 

Bayesian Classification [5] [6] [2] is based on Bayes 
theorem. A simple Bayesian Classification namely the 
Naive Classifier [5] [6] [8] is comparable in performance 
with decision tree [1][3] and neural network classifiers. 
Bayesian Classifiers [5] [6] [2] have also exhibited high 
accuracy and speed when applied to large database. 

While applying Naive Bayes Classifier [5] [6] [8] to 
classify text, each word position in a document is defined 
as an attribute and the value of that attribute to be the word 
found in that position. The calculation of first term of this 
classifier is based on the fraction of each target class in the 
training data. 

V NB = argmax P(Vj)YlP(a.i/Vj) . . . (3) 

Then the second term of the equation is calculated by the 
following equation after adopting m-estimate approach in 
order to avoid zero probability value, 



n+\vocabulary\ 

where, n = Total no of word set position in all training 
examples whose target value is j, n k = No. of times the 
word set found among all the training examples whose 
target value is j, Ivocabularyl = The total number of 
distinct word set found within all the training data 

"What if I encounter probability values is zero?" There is 
simple trick to avoid this problem. To assume that our 
training database, D, is so large that adding one to each 
count that to need would only make a negligible difference 
in the estimated probability value, yet would conveniently 
avoid the case of probability values of zero. This 
technique for Laplacian correction or Laplace estimator, 
named after Pierre Laplace, a French mathematician who 
lived from 1749 to 1827. If q counts to which to add one, 
then remember to add q to the corresponding denominator 
used in the probability calculation [9]. 

III. The Proposed Method 

The proposed method to classify text is an 
implementation of Apriori Algorithm. In this first collect 
the large data items on the electronic form. Then after 
remove the noise to using data cleaning techniques. Now 
implement the Apriori Algorithm and to find out the 
keywords of the data for all category related topics and 
obtain probability using Naive Bayes Classifier. 

A. The Proposed Algorithm 

The following algorithm is applicable at the class 
determination phase of testing phase. That is after the 
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probability table as well as association rules have been 
created for the training data, text preparation for the test 
data is done and then this algorithm is applied for the 
classification. 

The propose algorithm is work as follows: 

1 . For each set no. of input files, min_sup, min_confi 

2. To pre-process for each file 

3. Obtain key words 

4. If count/no. of input files is greater than min_sup 
then 

5. To save candidate sets and count 

6. Else to discard 

7. Calculate the probability for each set and each 
class 

8. End 

B. Flow Chart of the Technique 

c 




T 



Fig. 2 Flowchart of the Proposed Algorithm 

IV. Experimental Evaluation 

A. Preparing Webpage for Classification 

To take webpage from different sites or different types 
have been used to analyze the experiment. Here i take five 
different classes are as Cricket, Hockey, Tennis, 



Football,Baseball. I collect number of web pages related to 
their different classes. 

To make the raw text valuable, that is to prepare the text, 
considered only the keywords. That is unnecessary words 
and symbols are removed. For this keyword extraction 
process to drop the common unnecessary words like am, 
is, are, to, from .etc. and also dropped all kinds of 
punctuations and stop words. Singular and plural form of a 
word is considered same. Finally, the remaining frequent 
words are considered as keywords. 

Let web page: 
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Fig. 3 Example of Web page 

After pre-processing the above text found the following 
Frequent or Keywords words: 

{cricket, ground, match, ball, bowler, pitch, catch, over, 
team, run, wicket } 

B. Represent the Keywords in Binary Value: 

When execute the apriori algorithm then generated the 
sets of associated words file, configuration file and 
transaction file and the probability file. Here below to see 
the transaction file. In this file 1 value assign the present 
of keyword and the value assign the absent keyword. 
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Keywords Represent in Binary Value 
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In this work, numbers of web pages are used as training 
data for learning to classify text from all five categories, 
for all data set are take as 50% total amount of data, which 
50 are from Cricket, 41 are from Hockey, 46 are from 
Tennis, 43 are from Football, 44 are from Baseball. After 
preprocessing the text data association rule mining is 
applied to the set of transaction data where each frequent 
word set from each abstract is considered as a single 
transaction. 



C. Deriving Associated Word sets 

Each webpage is considered as a transaction in the text 
data. After pre-processing the text data association rule 
mining [5] [6] is applied to the set of transaction data 
where each frequent word set from each webpage is 
considered as a single transaction. Using these 
transactions, to generated a list of maximum length sets 
applying the Apriori algorithm [5] [6] [8]. The support and 
confidence is set to 0.55 and 0.75 respectively. 



TABLE II 

Word set with Occurrence Frequency for 50% data sets 



Word sets 
Found 


Number of Occurrence Documents 


Cricket 


Hockey 


Tennis 


Football 


Baseball 
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fraction of each target class in the training data. From the 
D. Associated word Set with Probability Value generated word set after applying association mining on 

To use the Naive Bayes classifier for probability 50% training data, and found the following information 
calculation the generated associated sets are required. The based on the result, 
calculation of Equation (3), this classifier is based on the 
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Total No. of Word Set = 15 
Total No. of Word Set from Cricket = 3 
Total No. of Word Set from Hockey = 1 
Total No. of Word Set from Tennis = 3 
Total No. of Word Set from Football = 1 
Total No. of Word Set from Baseball = 7 



Prior probability for Cricket, Hockey, Tennis, Football, 
and Baseball are 0.2, 0.06, 0.2, 0.06, and 0.46 
respectively. Then Equation (4) is calculated according 
to the equation. The probability values of word set are 
listed in Table III. 



TABLE III 

Word set with Probability Value for 50% data sets 



Word sets 
Found 


Probability 


Cricket 


Hockey 
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Football 
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TABLE IV 

Accuracy Regarding Different Set of Test Data 



Data set 


Accuracy 


Total 
Data 

set % 


Total 
Amount 
of Data 


Cric 
ket 


Hoc 
key 
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ball 


Accurate Amount of Data 


Total 
Amount 
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describe the old proposed method [5] with new proposed 
E. Comparative Study: method with their different data set in percentage. Then 

In this section, I tried to represent comparative after ' P lot for that data set 
presentation in different point of views. Below Table V 
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TABLE V 



Percentage of Accuracy Vs Percentage of Data sets 



% of Data sets 


% of Accuracy 


Old Proposed Work 


New Proposed Method 
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Fig. 4 % of Data sets Vs % of Accuracy 



At the beginning of the experiment started with 10% of the 
data sets, which showed unsatisfactory accuracy. Then to 
increased data set to 20% which showed development in 
accuracy. Next as to increase the percentage of training 

V. Conclusion 

This technique presented an efficient technique for 
web page classification. This technique will be more 
effective is the training set is set in such a way that it 
generates more sets. Though the experimental results are 
quite encouraging, it would better if the work with larger 
data sets with more classes. The existing technique 
requires more or less data for training as well as less 
computational time of these techniques. 

VI. Future Work 

In training set of data, although all the web pages have 
almost equal size of length, they have slightly different 
number of frequent words after pre-processing them. In 
order to avoid null attribute value in any transaction in the 
set of transaction database. These word sets containing 
null values have no use in classification. Increase the 
number of different types of class value for generating 



data set accuracy became more desirable. I checked up to 
55% training data. In this process, considering accuracy 
overall 68% accuracy i.e., 50% data set as the best. 

associated word sets. In future, take different values of 
support and confidence and to obtain different types of 
result of their classes. 
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